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The utility of limited feedback for coding over an individual sequence of DMCs is investigated. This study 
complements recent results showing how limited or noisy feedback can boost the reliability of communication. A 
strategy with fixed input distribution P is given that asymptotically achieves rates arbitrarily close to the mutual 
information induced by P and the state-averaged channel. When the capacity achieving input distribution is the same 
over all channel states, this achieves rates at least as large as the capacity of the state averaged channel, sometimes 
called the empirical capacity. 

I. Introduction 



Many contemporary communication systems can be modeled via a time-varying state. For example, in wireless 
communications, the channel variation may be caused by neighboring systems, mobility, or other factors that are 
^ \ difficult to model. In order to design robust communication strategies, engineers should adopt an appropriate model 
■ that can capture the channel dynamics. One such model is the so-called arbitrarily varying channel (AVC), where the 
state can depend on the communication strategy and is selected in the worst possible manner. One interpretation of 
\ this model is that there is a fixed rate that one wants to support over the worst possible channel states. An alternative 
O ' an d perhaps more relevant approach is an individual sequence model, where the state is fixed but unknown and not 
dependent on the communication strategy. Here, a natural requirement is for a strategy to perform well whenever the 
state sequence is favorable, while for less favorable state sequences, inferior performance is acceptable. Essentially, 
this model considers the case in which one wants to adapt the rate to what the specific state sequence can support. 
In order to achieve this variation in performance, the encoder must obtain some measure of the quality of the state 
. £^ \ sequence. This requires additional resources, and the most natural model is to introduce feedback from the receiver 
^ ■ to the transmitter. A second resource is joint randomization between the encoder and the decoder, which can also 
^ . be enabled via feedback. The encoder can use feedback to estimate the channel quality and hence communicate 
at rates commensurate with the channel quality. Two fundamental questions are the following: first, how good a 
performance (in terms of achievable rate) can one expect for favorable state sequences? Second, how much feedback 
is required to attain this performance? Many of the works in this area can be understood in terms of how they 
answer these two questions. 

The main trade-off for the channel model at hand is the correct balance between the resources spent on 
communication versus those spent on channel estimation. This trade-off is well understood in the case where 
the channel state sequence is fully revealed to the receiver, as shown in the work of Draper et al. [2]. Regarding 
the first question, for any fixed input distribution, their scheme can achieve rates arbitrarily close to the mutual 
information of the channel with the state known to both the transmitter and receiver. They also provide an interesting 
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Fig. 1. Model setup with limited feedback and common randomness. 
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Related results and assumptions on channel model, feedback, state information and common randomness 



answer to the second question: a feedback link of vanishing rate is sufficient to attain this performance. To sum 
up, when channel estimation at the receiver is free, feedback of vanishing rate is enough. 

Shayevitz and Feder [3] consider the more realistic case where the decoder has only the channel outputs. They 
develop a scheme in which the receiver keeps estimating the state sequence. In their consideration, the transmitter 
has full (causal) output feedback and can thus also track the state sequence. For the class of channels they consider, 
Shayevitz and Feder establish an achievable rate that they call the "empirical capacity," which they define as 
the capacity of an i.i.d. channel with transition probabilities corresponding to the empirical statistics of the noise 
sequence. Therefore if feedback is free, then rates arbitrarily close to the "empirical capacity" are achievable. 

This paper is a commentary on this development: we consider the same notion of "empirical capacity," but provide 
an answer to the second question. Specifically, for a fixed input distribution, we show that if common randomness is 
available, a feedback link of vanishing rate is sufficient to achieve the empirical mutual information, which in some 
settings, such as the class of channels considered by Shayevitz and Feder, coincides with the "empirical capacity". 
To do this, we adapt the feedback-reducing block/chunk strategies used earlier in the context of reliability functions 
[4], [5], and most specifically in [6]. They are in turn inspired by Hybrid ARQ [7]. Thus, the flavor of our algorithm 
is different from [3]. By doing away with the output feedback, we lose the simplicity of the scheme in [3], but we 
show that similar rates can still be obtained with almost negligible feedback. 

The strategy developed in this paper fits in the category of rateless codes, which are a class of coding strategies 
that use limited feedback to adapt to unknown channel parameters. Most studies about feedback for rate and 
reliability have centered around full output feedback [4], [8]-[14]; however, recent work has started to improve our 
understanding of how limited feedback affects these performance measures. For instance, limited feedback can be 
used to improve reliability [6]. Furthermore, in some multiuser Gaussian channels, noisy feedback increases the 
achievable rates [15], [16] and the reliability [5], [17]. In a rateless code the decoder can use a low-rate feedback 
link to inform the encoder when it decodes. These codes were first studied in the context of the erasure channel 
[18], [19]. Later work focused on compound channels [20]-[22]. The work of Draper et al. [2] is to our knowledge 
the first step towards adapting rateless codes to time-varying states. 

We are now in a position to compare the modeling assumptions in these previous works with the current 
investigation; the comparisons are summarized in Table HI The initial studies of rateless coding by Shulman [20] 
and Tchamkerten and Telatar [22] used feedback to tune the rate to the realized parameter governing the channel 
behavior. The study of time-varying states was first introduced by Draper et al. [2], but they assumed full state 
information at the decoder, which leads to higher rates. Most recently, Shayevitz and Feder [3] showed an explicit 
coding algorithm based on Horstein's method [8] that achieves the empirical capacity. Their scheme uses full 
feedback, but in turn works for a larger class of channel models. Moreover, it is a horizon-free scheme. 

In our scheme, the encoder attempts to send k bits over the channel during a variable-length round. The encoder 
sends chunks of the codeword to the decoder, after which the decoder feeds back a decision as to whether it can 
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decode. The encoder and decoder use common randomness to choose a set of randomly chosen training positions 
during which the encoder sends a pilot sequence. The decoder uses the training positions to estimate the channel. As 
soon as the total empirical mutual information over the aggregate channel sufficiently exceeds k bits, the decoder 
attempts to decode. Through this combination of training-based channel estimation and robust decoding we can 
exploit the limited feedback to achieve rates asymptotically equal to those with advance knowledge of the average 
channel. 

In the next section, we motivate the study of this problem with some concrete examples. In Section [Till we define 
the channel model, state our main result, and describe the coding strategy. Section [TV] contains the analysis of our 
strategy with most of the technical details reserved for the Appendix. 



The following two simple examples will prove useful in explaining the meaning of the main result of this paper, 
and help motivate the present study. The first is the model considered in [3] - a binary modulo-additive channel 
with a noise sequence whose empirical frequency of l's is unknown. In this example, the "empirical mutual 
information" under all state sequences is maximized by the uniform distribution, so our algorithm achieves the 
"empirical capacity". In the second example we consider the Z-channel for which the input distribution maximizing 
the empirical mutual information is not identical for all state sequences, so our scheme will not in general achieve 
rates as high as the empirical capacity. 

A. Binary modulo-additive channels 

The simplest example of a channel with an individual noise sequence is the binary modulo-additive channel. This 
channel takes binary inputs and produces binary outputs, where the output is produced by flipping some bits of the 
channel input. These flips do not depend on the channel input symbols. The output y G {0, 1}^ can be written as 



where x € {0, 1} is the channel input, z £ {0, 1} is the noise sequence, and addition is carried out modulo-2. 
The noise z is arbitrary but fixed, and we let p be the empirical fraction of l's in z, which is arbitrary but fixed 
over the [0, 1] interval. 

Because the state sequence z is arbitrary and unknown, it is not clear how to find the highest possible rate of 
reliable communication. For any fixed z, we could say naively that the capacity is one bit, because the channel is 
deterministic. However, z is unknown and may, in fact, have been generated iid according to a Bernoulli distribution 
with parameter p, in which case the capacity should be no larger than 1 — h(p), namely, the capacity of a binary 
symmetric channel (BSC) with crossover p. The algorithm in this paper guarantees a rate close to 1 — h{p) for any 
state sequence z with an empirical fraction of l's equal to p. This rate can be thought of as the empirical mutual 
information of the channel with input distribution (1/2, 1/2). Since the input distribution is the same for all BSC's, 
the rate can also be called the empirical capacity, as in the work of Shayevitz and Feder [3]. 

B. Z-channels with unknown crossover 

Whereas the example above can be thought of as an XOR operation with the channel state, in our second example, 
we consider a binary channel in which the output is the logical OR of the input and state. For input x and noise 
y, the output is given by the following: 



Again, the noise sequence z is arbitrary but fixed. Let q denote the empirical fraction of l's in z. 

The algorithm in this paper achieves rates close to those corresponding to a Z-channel with crossover probability 
q. The channel is the average W z of W(y\x, Zi) over z. Unlike the previous examples, this channel has a capacity 
achieving input distribution that depends on q. The algorithm proposed in this paper chooses a fixed input distribution 
P and achieves the mutual information I(P, W z ) of a Z-channel with that input distribution. This leaves open the 
question of how to choose P. One method is to choose the P that minimizes the gap between maxg I(Q, W z ) — 
I(P, W z ) over all z. However, in many cases the uniform distribution is not a bad choice, as shown by Shulman 
and Feder [23]. In our results we leave the choice of P open for the designer. 



II. Motivating Examples 



y = x © z 



(1) 




(2) 
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III. The channel model and coding strategy 

A. Notation 

Script letters will generally be used to denote sets and alphabets and boldface to denote vectors. For a vector 
x = (x\,X2, • • • , x n ), we write x^ for the tuple (xj, ccj+i, . . . ,Xj) and x J for the tuple (x\, X2, ■ ■ ■ , Xj). The notation 
[J] will be used as shorthand for the set {1,2,..., J}. The probability distribution T z is the type of a sequence z. 
For a distribution Q, the set T^{Q) is the set of all length N sequences of type Q. 



B. Channel model and coding 

The problem we consider in this paper is that of communicating over a channel with an individual state sequence. 
Let the finite sets X and y denote the channel input and output alphabets, respectively. The channel model we 
consider consists of a family of channels W = {iy(y|x, z) : z G Z} indexed by a state variable in a finite set Z. 
For any state sequence z = [z\, Z2, ■ ■ • , 2jv), and output yi, we assume 

F(y i \x i ,y i - 1 ,z,) = W(y i \x i ,z i ) . 

That is, the channel output depends only on the current input and state. 

We consider coding for this channel using the setup shown in Figure Q] We think of the rate-limited feedback 
link as a noiseless channel that can be used every nfb uses of the forward channel to send Br, bits. The rate of 
the feedback is thus = B^/n^. To avoid integer effects, we will consider only integer values for nfb and B^. 
We assume that the encoder and decoder have access to a common random variable G distributed uniformly over 
the unit interval [0, 1]. This random variable can be used to generate common randomness that is shared between 
the encoder and decoder. 

Because the maximum capacity of this set of channels is C max = log min{|^|, \y\), we define the set of possible 
messages to be the set of all binary sequences {0, 1} max . This message set is naturally nested - the truncated 
set {0, 1} T is a set of prefixes for {0, \} NC >™*_ At the time of decoding, the decoder will decide on a decoding 
threshold T € N and a message m £ {0, 1} T . The threshold T is itself a random variable that will depend on the 
state sequence z, the common randomness G, and the randomness in the channel. 

An (N, nfb,-Bfb) coding strategy for blocklength N consists of a sequence of (possibly random) encoding 
functions for i = 1, 2, . . . , N, 

th : {0, 1} NC ™ x {0, l}L(*-i)/nn.JB«. x [0, 1] -> X , (3) 
a sequence of (possibly random) feedback functions for i = nfb, 2nfb, . . .: 

4>i : y X [0, 1] -> {0, l} Bfb , (4) 

and a decoding function 

V> : y N x [0, 1] -> {0, 1, ... , iVC max } x {0, 1} NC — . (5) 

We say a message m £ {0, iy NC ™** is encoded into a codeword x 6 X N if 

xi = 7fc(m, Mv nih ,G), <%-i)/„ fbJ (yLC-D/^W G)j G) v * G [AT] . (6) 

For an (JV, n^, -Bfb) coding strategy, let ip(y, G) = (T, rh). The first output T € {0,1, ... , NC max } is the decoding 
threshold and rh T is the message estimate. Both of these quantities are random variables. 

For a state sequence z, the maximal error probability of an (N,n^, B^) coding strategy, is defined as 



e d ec(z) = max P G w ( m T / m T 



(7) 



where the probability is taken over the common randomness G and randomness in the channel. For a state sequence 
z, a rate R is said to be achievable with probability 1 — e ac h (z) if 



e ach (z) = max P GjW i2 > T/N, m 1 ± m 

me{0,l}" c »» 



z,m . (8) 
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Note that we can upper bound e a ch( z ) • 

' z, m ) . (9) 



£ach(z) < e d ec(z) + max ¥ G ,w R > T/N 

Note that this channel model assumes a known finite horizon N, unlike the infinite horizon model of Shayevitz 
and Feder [3]. Furthermore, the basic model assumes an unbounded amount of common randomness in the form 
of the real number G. This point is discussed further in Section [V] 



C. Mutual information definitions 

The results in this paper are stated in terms of mutual information quantities involving time-averaged channels 
dependent on the individual state sequence z. For fixed z define the state-averaged channel to be 

1 - 

W z (y\x) =-J2w(y\x,Zi) . (10) 

i=l 

Note that if z and z! have the same type, then the state-averaged channels generated by them are the same. Define 
the empirical channel for a distribution Q on 2: 

W Q (y\x) = J2w(y\x,z)Q(z) . (11) 

For a fixed input distribution P(x) on X and channel W(y\x), the mutual information is given by the usual 
definition: 

W(y\x)P(x) 



IP. 



\W) = J2 W(y\x)P(x) log — — 



, W{y\x')P{x<) ■ 

For an individual state sequence z the empirical mutual information is given by I (P,W Z ). 



D. Optimality versus empirical capacity 

We are interested in analyzing strategies that can adapt their rates depending on the state sequence, and in our 
analysis, we want to consider the rates achieved by a strategy as a function of the state sequence. Unlike the 
compound channel setting (see e.g. [24] for definitions), which considers the worst-case behavior of a strategy over 
a class of channels, we instead want strategies that perform universally well over all sequences. However, this raises 
the problem of finding a notion of optimality that does not depend on the worst-case performance. 

One possibility is to define an optimal strategy as one that, for every state sequence, achieves a rate at least as 
large as any other strategy for that sequence, and then define the capacity as the rates achieved by this strategy. 
However, this means comparing a strategy for all sequences against all strategies tailored to a fixed sequence. In 
the example in Section III- A I for each z there exists a decoding strategy which adds z to the output, undoing all of 
the bit flips. Each strategy achieves rate 1 for the specific choice of z, but this is clearly an unreasonable target. 

Instead, for each sequence we can consider a set of reference strategies and measure the "regret" of our strategy 
with respect to the reference strategies for each sequence. We take an approach inspired by source coding for 
individual sequences, in which we have a benchmark rate for each state sequence and then test whether a coding 
strategy attains the benchmark for each state sequence. 

One such benchmark that we consider in this paper is the empirical capacity - for a fixed z, the empirical 
capacity is defined as the supremum over all input distributions of the empirical mutual information: 

C(z) = sup / (P,W Z ) . 

P(x) 

First used by Shayevitz and Feder [3], empirical capacity is given its name not because it is purported to be optimal, 
but instead because of its resemblance to the capacity of a point-to-point discrete memoryless channel. 

There are two points that are worth mentioning before proceeding to describe the results in this paper. First, it 
is easy to see that the empirical capacity is a weaker target than the best possible strategy for a given sequence. It 
is possible that a strategy can achieve rates larger than the empirical capacity. In the example in Section III-AI if 
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the sequence z were all for the first half and all 1 for the second half, the empirical capacity is 0, whereas the 
coding strategy presented in this paper is expected to achieve rates close to 1. 

Second, there may exist examples for which no strategy is guaranteed to achieve the empirical capacity. The 
coding strategy proposed in this paper uses a fixed input distribution P, and in general, the maximizing P(x) may 
not be the same for all zQ In these cases our strategy can achieve rates close to the empirical mutual information 
I (P,W X ) but not the empirical capacity C(z). It may be possible to adapt P over time, and finding a strategy 
achieving C(z) or a counterexample showing that for some channels, no strategy achieving C(z) is possible, is 
left for future research. 



E. Main result 

The main result in this paper is that the algorithm given in the next section achieves rates that asymptotically 
approach the mutual information I (P, W z ) for a large set of state sequences z. 

Theorem 1: Let {W(y|a;, z) : z G Z} be a given family of channels. Then given any p > 0, e > 0, A* > 0, and 
channel input distribution P, there exists an N sufficiently large and an (N, n^, Br>) coding strategy with feedback 
rate 

Rib = — < A* , (13) 



such that for all z G Tq(N), the rate 



R>I(P,W Q )-p (14) 



is achievable with probability 1 — e. 

Binary modulo-additive channels, revisited: For the binary additive example in Section ITl-AI p denoted the fraction 
of ones in the noise sequence z. Then, the empirical capacity is 1 — h(jp), the capacity of the binary symmetric 
channel with crossover probability p. Theorem Q] implies the existence of strategies employing asymptotically zero- 
rate feedback such that for all p, e > and sufficiently large N, 

R>l- h{ P ) - P , (15) 

is achievable with probability at least 1 — e. 

Z-channels with unknown crossover, revisited: For the example in Section Hl-BI with q equal to the fraction of 
l's in the crossover sequence, the capacity achieving input distribution is a function of q, so the theorem cannot 
guarantee a scheme achieving the empirical capacity. Despite this, it still provides achievable rates in this setting. 
If the channel input distribution has P(X = 1) = p x for this channel, then the empirical mutual information for 
this channel can be written as 

I (P,W q ) = h(p x ) - (1 - p x + p x q)h ( M- ) , (16) 

and is asymptotically achievable from Theorem Q] As discussed briefly at the end of Section IIII-CI the question of 
how to select p x is outside the framework of this paper; one possibility is given in equation (fT2l) . 

'A question then arises of how one chooses the input distribution P. One possibility could be to choose P to be uniform over the input 
alphabet. However, depending on the setting, other approaches might be preferable. Inspired by the theory of AVCs, one may choose the 
input distribution to be 

P = argmax inf l(P',W Q ) , (12) 

Pi Q:l(P',W Q )>p 

where p is a parameter governing the gap between the rates guaranteed by the algorithm and the empirical mutual information of the channel. 
This approach can run into problems in some situations in which for the P chosen, I (P, Wq) = for a large subset of state distributions 
Q, but there exists a distribution P for which / (p, Wg^j > p for all Q. On the other hand, if one were to remove the condition that 

I (P' ,Wq) > p, for the example in Section Hl-AI infQ I (P 1 ,Wq) = for all choices of P', and the choice of P' would be arbitrary. 
Because of such issues, we will leave the question of how to choose the input distribution P unanswered in this work. The problem of 
choosing P is similar to that studied by Shulman and Feder [23]. 
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F. Proposed coding strategy: Randomized rateless code 

The achievability result in Theorem Q] relies on the following coding strategy, which can be thought of as iterated 
rateless coding with randomized training (or, for short, randomized rateless code). The overall scheme is illustrated 
in Figure |2] The scheme divides time into chunks of b(N) channel uses and in each round attempts to send k(N) 
using a randomized rateless code. Each chunk contains randomly places training sequences so the decoder can 
estimate the empirical channel. The decoder chooses to decode when the empirical rate falls below the estimated 
empirical mutual information calculated from the channel estimates. After the k(N) are decoded the round ends 
and the encoder starts a new round to send the next k(N) bits. The length of each round is variable and depends 
on the empirical state sequence. 

We now describe each component of the scheme in more detail. 

1) Feedback: Divide the blocklength N into chunks of length b = b(N). Feedback occurs at the end of chunks, 
so n ft = b with three possible messages: "BAD NOISE," "DECODED," and "KEEP GOING," which correspond 
to the feedback messages 00, 01, and 10, respectively. Thus, = 2, so the feedback rate = X(N) is given 
by the expression 

(17) 

If the chunk size b(N) goes to infinity as N — » oo, the feedback rate X(N) — ► 0. 

2) Rateless coding: A rateless code is a variable-length coding scheme to send a fixed number of bits. In the 
algorithm proposed here, the encoder attempts to send k = k(N) bits over several chunks comprising a round. 
Rounds vary in length and terminate at the end of chunks in which the decoder feeds back either "BAD NOISE" 
or "DECODED." Let l r denote the time index at the end of round r: 

l r = min {j = i ■ b{N) > £ r _ x : fa = "BAD NOISE" or "DECODED"} , (18) 

and set £q = 0. 

An (M*, c, k) rateless code is a sequence of maps {(fii, vi) : i = 1, 2, . . . M*}, where 

IH : {0, if - X c (19) 
Vi : T c -> {0, l} k . (20) 

The encoding maps \i\ produce successive chunks of a codeword for a given message, and the decoding maps attempt 
to decode the message based on the channel outputs. An (M*,c, k) randomized rateless code is a random variable 
that takes values in the set of (M*,c,k) rateless codes. The maximal error probability i(M,z) = i(M,z,V) for 
a randomized rateless code V decoded at time Mc with state sequence z G Z Mc is 



e(M,z,V) = max E \w Mc (W(yf c ) + m\ m(m),z) 
me{o,i} fc L V / 



(21) 



= max e m (M, z,X>) , (22) 

mg{0,l} fe 

where the expectation is taken over the randomness in the code. We will suppress dependence on V when it is clear 
from context. The randomized rateless code used in this paper has codewords with constant composition P{x) on 
X and uses a maximum mutual information (MMI) decoder. 

3) Training : The coding strategy analyzed in this paper uses a randomized rateless code in conjunction with 
randomly located training symbols. The training allows the decoder to estimate the channel and choose an appropriate 
decoding time. For each chunk of b channel uses, the scheme uses t = t(N) positions for training. Using the common 
randomness G, the encoder and decoder select t training positions T r>n for the n-th chunk of round rJl Formally, 
T r>n is uniformly distributed over subsets of {£ r -i + (n — 1)6 + 1, . . . , £ r _i + nb} of cardinality t. This set is further 
randomly partitioned into \X\ subsets T r n (x) for x £ X. 

2 There is a slight abuse of notation with the type Tjv(Q), but the double subscript in T r , n should make the distinction unambiguous. 
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4) Encoding: The encoder attempts to send a message m € {0, l}-^"** over several rounds. In each round it 
attempts to send a sub-message m r € {0, l} k consisting of k bits of m. The submessage mi is the first k bits of 
m. If the round r — 1 ended with "BAD NOISE" then m r = m r _i, and if round r — 1 ended with "DECODED" 
then m r is the next k of the message m. 

The encoder and decoder share an (M*, b — t,k) randomized rateless code. Using the common randomness G, at 
the start of each round the encoder and decoder choose an (M*, b — t,k) rateless code {fij, uj) : j = 1,2, . . . M*} 
according to the distribution of this randomized code. Define the encoding map rji in the n-th chunk of the r-th 
round: 



rn(m r , G) = x i & T Tyn (x) 
{r]i(m r ,G) : i £ T rn } = fJ- n (m r ) . 



(23) 
(24) 



That is, the n-th chunk transmitted by the scheme is created by taking the b — t piece of the codeword fj, n (m r ) and 
inserting the t randomly chosen training positions, as illustrated in Figure [2] The dependence of rji on the feedback 
is suppressed here because a round r is terminated as soon as the feedback message is no longer "KEEP GOING." 



total: 



round: 



chunk: |~n 




■ ■ I 



training codeword 

Fig. 2. After each chunk of length b feedback can be sent. Rounds end by decoding a message or declaring the noise to be bad. 



5) Decoding: The decoder uses the training symbols {j/j : i 6 T r n (x)} to estimate the channel transition 
probabilities W z (y\x) and thereby obtain an estimate of the empirical mutual information I (P,W Z ) during the 
chunk and over the round. If the estimated mutual information is too low, then it feeds back "BAD NOISE." If 
the estimated mutual information is above the empirical rate kj{n(b — t)) + e\ then it decodes the code using the 
MMI decoder u n of the rateless code and feeds back "DECODED." Otherwise, it feeds back "KEEP GOING." The 
parameter ei ensures that with high probability the empirical rate is below the true empirical mutual information 
of the channel. 

6) Algorithm : The parameters of the algorithm are a the chunk size b(N), training size t(N), number of bits 
per round k, and decoding thresholds ei and r. 

Given an (M*,b — t, k) randomized rateless code and message bits m r , the encoder and decoder first use common 
randomness to choose a realization of the randomized rateless code. The following steps are then repeated for each 
chunk in round r: 

1) Using common randomness, the encoder and decoder choose t = t(N) positions T r n and a random partition 
of T r:n into \X\ subsets T r ^ n {x) of size t/\X\ for training in chunk n. 

2) The encoder transmits the n-th chunk using the encoding map as defined in Equations (|23l-(f24l. In particular, 
the symbol x is sent during the training positions T r n (x). 

3) The decoder estimates the empirical channel in chunk n and the empirical channel over the round so far: 

w^{y\x) = M • \{j G T rjn {x) : y 3 = y}\ (25) 
1 n 

W^(y\x) = -Y j 4 i \y\x) . (26) 

i=l 

4) The decoder makes a decision based on Wr and n. 
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a) If 

l(p,W$ n A -ei <r , (27) 



where r > is a parameter of the algorithm, then the decoder feeds back "BAD NOISE" and the round 
is terminated without decoding the k bits. In the next round, the encoder will attempt to resend the k 
bits from this round, 
b) If 



/ (p, W^ n) 



ei > jt—^ > (28) 

(b — t) x n 

where t is defined in Section |TlI-F.3l then the decoder decodes, feeds back "DECODED," and the encoder 
starts a new round, 
c) otherwise the decoder feeds back "KEEP GOING" and goes to |2]). 
Thus, we have that 



(y,G) 



"BAD NOISE" , / (p, W r (n) ) - ex < r, I (p, W^) - e x < 

k 

(b-t)x 



"DECODED" , l(p, W^ n) \ - ei > ,, ^ ( 29 > 



"KEEP GOING" , otherwise 



This strategy has two main ingredients. First, the encoder uses random training sequences to let the decoder 
accurately estimate the empirical average channel. Given this accurate estimate, the decoder can track the empirical 
mutual information of the channel over the round. Second, the decoder only needs to know that the empirical rate 
is smaller than the empirical mutual information in order guarantee a small error probability. 

We note again that the channel model and problem formulation involve a fixed overall blocklength N and other 
parameters of the coding strategy are defined in terms of this parameter. However, in practice it may be more 
desirable to fix a number of bits k(N) to send per round and then define the coding parameters in terms of k. 
We have chosen the former method because it is convenient for our mathematical analysis, but we believe that in 
principle the problem could be formulated in an "infinite -horizon" manner as well. 



IV. Analysis 

Showing that the strategy proposed in the previous section satisfies the conditions of Theorem Q] requires some 
more notation. For each round r, let the random variable M(r) be the number of chunks in that round: 

M(r) = inf \l (P, lV r W) - ei < t or - - < / (p, W^ n A - eA . (30) 
n>o I \ J [b — t)n V / J 

Let U ritr denote the time indices in the n-th chunk of round r that are not in the training set T n ^ r . 

The scheme depends on a number of parameters - the overall blocklength N, the number of bits per round 
k(N), the chunk size b(N), the number of training positions per chunk t(N), the rate gap e\{N), the error bound 
e, and the feedback rate \(N). In order to make the proof of the result clear, assume that there exist real constants 

91,92,93 G (0, |) with g x > g 2 > g 3 and set 

k(N) = e(N 291 ), b(N) = e(N^), t(N) = @(N 93 ) . (31) 
In particular, this means that the ratios k(N)/N -> 0, (b(N)) 2 /k(N) -> 0, and t(N)/b(N) -> 0. 



A. Error events 

The scheme requires that the channel estimates Wr M ^ given in (l26l) be "close" to the channel averaged over 
the non-training positions C7 n>r . (defined after d30l ) above), and the channel averaged over the entire round. The 
former guarantees that the estimates provided by training are close enough to guarantee that the rateless code is 



10 



I IP, W, 



empirical rate 



r + ei 




Fig. 3. Curve of the empirical rate illustrating the bounds on M. The upper bound M* is given by 1341 



decodable, and the latter guarantees the gap between the rates achieved by the scheme and the empirical mutual 
information is small. A channel estimation error E\(r) occurs for round r if 



i(pM m ^ 



or 



/ 1 M(r) \ 

\ ^ I U=\ ieUn.r J 

l(p,W^)-l(p,±j: £ WWz,*)) 

\ n=l ieU n , r VT n , r j 



> 



>'i- 



(32) 



(33) 



A decoding error E2 (r) happens in round r if the rateless code selected by the encoder and decoder experiences 
an eiTor. 



B. Preliminaries: Bounding the length of a round 

Before preceding to identify the error events, we will provide bounds on the length of a round. Our reasons for 
establishing these are two-fold. First, if a round fails to terminate or does not result in successful decoding, the 
round length should be sufficiently small so that its impact on the overall rate should be small. Second, when taking 
union bounds over chunks in a round, the round length should be small enough to guarantee the corresponding error 
probabilities are small. Moreover, it helps set the maximum length for the randomized rateless code, defined on 
page|7J Lemma Q] provides bounds on M(r), the number of chunks in round r, which can be expressed equivalently 
as £ r /b(N), where £ r is defined in (fT8l) . For simplicity, we will use M to denote M(r) when the round r is clear 
from context. 

Lemma 1 (Bounds on M): Fix t\ > and r > 0. Then for the scheme described in Section ITlI-F 6 [ the stopping 
time M satisfies M < M* , where 

k{N) 



M* . 

(b(N) — t(N)) ■ t 
If the decoder attempted to decode, then M > A/*, where 

k(N) 



(34) 



(35) 



(b(N) - t(N)) ■ C7 max • 

Proof: The argument is illustrated in Figure [3] The empirical rate given by d28l) is shown in the curve. The 
empirical rate ^ b _^ xM decreases monotonically with M. In order for the algorithm to continue at time M, from 



29l) we must have 



(b-t)xM 

The lower bound is trivial from the definition in 



> / (P, W, 



(n) 



ei > r. Rearranging shows that M must be less than M* in ((34 
and the cardinality bound on mutual information. I 
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C. Channel estimation for a single round 

In this section, we provide an upper bound on the error event E\(r). The argument relies on the following 
observation: if sufficiently many samples are collected to estimate the channel, these estimates converge to the 
overall average channel. Lemmas [2] and [3] make this precise. That is, with a modest number of randomly chosen 
training symbols, the decoder can estimate the empirical mutual information of the channel such that the probability 
of the channel estimation error event E\(r) is small. 

Lemma 2 (Simple channel estimation): Recall the chunk training estimates defined in d25l ), and let parameters 
satisfy the conditions in (|3"T1 ). Then for any > there exists an ./V sufficiently large and constant a\ such that 
for the j-th chunk the training estimates satisfy: 



wi j) {y\x) -W, 



z(U rd UT, 



, 3 ){y\x) 



w { J\y\x) - W m{p ){y\x) 



> €4 V x, yj < exp 

> €4 V x, y ) < exp 



-aie\t) 



(36) 
(37) 



where t is the size of the training set T r j. 

Proof: Proving the claim requires two applications of Hoeffding's inequality [25] to the training data. The 
first uses the sampling with replacement version of the inequality to show that the training estimates are close to 
the state-averaged channel at those training positions. The second uses the sampling without replacement version to 
show that the state-averaged channel in the training positions is close to the state-averaged channel over the entire 
chunk. An application of the triangle inequality and our parameter assumptions in (OTT ) complete the argument. 

We now make this precise. First consider the random variables {l(yi = y) : i G T r j(x)} for each x and y. Their 
expectations over the channel are {W(y\x,Zi) : i £ T r j(x)}. Applying Hoeffding's inequality to these variables 
shows that their mean, which is wP\y\x), is close to W z (x r -( x ))> the average channel during the training: 



w 



Cj) 



(y\x) - W rA 



z(U r , 3 UT, 



d ){y\ x ) 



> £5 ) < 2 exp 



(38) 



Now, recall that the training positions T r n , defined on page [71 are sampled uniformly without replacement from 
the whole chunk, so the average channel W z ( Tr j ( x ){y\x) is itself a random variable formed by averaging the random 
variable {iy(y|x, Zi) : i G T r j(x)}. The mean of each of these variables is W z nj r .yjT r ■), the state averaged channel 
over the whole chunk. For sampling without replacement, another result of Hoeffding [25, Theorem 4] states that 
the same exponential inequalities for sampling with replacement hold, so the channel during the training is a good 
approximation to the entire channel during the chunk: 



\W m 



W, 



'z(T r:J (x)) ~ vv z ([/ r , J uT r , 3 )| > £5) < 2exp 
By applying the triangle inequality to equations (I38T ) and d39l , we have the following: 



-2-Le 
\X\ 



(\w^(y\x)-W, 



z([/ r ,,UT rJ ) 



Finally, observe the following: 



\ W z{U r}] UT rJ ) ~ W x ( UrJ )\ 



^ I 

I Yl W(y\x,Zi) - 7— - W(y\x,Zi) 



ieUr-jUTr-J 



\ W{y\x,Zi) 



b(b - 1) 



ieu r ,j 



(39) 



(40) 



(41) 



(42) 



(43) 



The assumptions in (1311 ) imply that (1431 ) can be made small for sufficiently large N. Thus for N sufficiently large, 
another application of the triangle inequality to (l40l) and (l43l) gives the following: 

t 



4 j) (y\x)-w z(Ur . 



>3e 5 ) <4exp( -2— et, 



(44) 
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Choosing 64 



3e5 and a union bound over all x G X and y € y we get 



\x) - W, 



>e 4 



<exp I -(2/9)^^ + log |^ 
< exp (-0:164*) , 



+ log4 



(45) 
(46) 



where the last inequality follows from taking N sufficiently large and the fact that t(N) increases with N. ■ 
Lemma 3 (Channel estimation): Recall the error event E\{r) defined on page|9j and let the parameters satisfy 
the conditions in (|3"TT ). Then for any ei > there exists N sufficiently large and an 02 > such that for any round 
r and any state sequence z G £ M ( r ' b , 



i p,wS M(r)) 




i(pM M{r)) 




1 



r(b - t) 



M{r) 

E W(y\x,Zi) 



n=l i£U n 



1 



M{r) 



P, rb S 

n=l i&U n r UT, 



^2 W(y\x,z,i 



> 



61 



> 



61 



< exp (—02*) 



< exp (— 02*) 



(47) 



(48) 



Therefore P(£i(r)) < 2exp(-a 2 i). 

Proof: For all (x,y), Lemma |2] guarantees that for any 64 > the channel estimated during the training of 
any chunk is within 64 of the average channel during the whole chunk and during the codeword positions with 
probability exp (— a\e\t). For a round of length M(r), a union bound over chunks shows that 



M(r] 



M(r) 

E w { r j) (y\x) 
i=i 



M(r) 



1 



M{r) 



M{r) 



^ w)P(y\x) 



M(r) 
3=1 

M(r) 



M(r) 



3=1 



> £4 V x, y] < M(r) exp (-a^t) (49) 



> e 4 V x, y\ < M(r) exp (-aie\t) . (50) 



Since M(r) is at most M*, for N sufficiently large the effect of the union bound is negligible. 

The remainder of the proof is to show that if the channel estimated from the training is close with high probablity 
to both the average channel during the codeword positions and the average channel during the whole round, then 
the empirical mutual informations must be close as well. Lemma [7] in the Appendix shows exactly this. For any 
ei > there exists a 64 > and N sufficiently large such if the events in d49l ) and (l50l) fail to hold then the events 
in d48l and d47l also fail to hold. This completes the proof. ■ 

Remark: Under the parameter assumptions in equation (f3TT >. the number of bits of common randomness needed 
in Lemmas |2] and [3] to specify the training positions is sublinear in the blocklength N. Note that a similar conclusion 
was reached by Shayevitz and Feder for their scheme, which also uses training positions to the estimate the channel 
[3]. This point is discussed in more detail in Section [V] on page [T71 



D. Rateless coding 

The last ingredient in our strategy is the rateless code used during each round. The key property we need is that 
if the empirical rate drops below the empirical mutual information of the channel, then the code can be decoded 
with small probability of error. 

Lemma 4 (Rateless codes): For any 5' > and distribution P, there exists an integer c sufficiently large, es > 
and an (M*,c,k) randomized rateless code defined in Section IIII-FI such that if at decoding time M the state 
sequence z^ c satisfies 

< I (P, W z mc) — 8' , (51) 
Mc ~ v 1 ' 

then its maximal error e(M, z), defined in ((2T]) . satisfies 



i(M,z) < exp(-Mce 8 ) 



(52) 
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Proof: Fix 5' and a distribution P. We can approximate P arbitrarily closely with a type of a sufficiently 
large denominator, so without loss of generality, we assume P is a type and choose c to be large enough so that 
the denominator of type P divides c. Let Cm{J) be a randomized rateless code. Specifically, Cm{J) is a random 
variable distributed on the set of rateless codes of blocklength Mc whose J codewords are drawn independently 
and uniformly from the composition-P set Tm c {P) an d with a maximum mutual information (MMI) decoder. 
The remainder of the proof can be sketched as follows: we verify that the codebook Cm {J) has satisfactory error 
performance under the assumptions of this Lemma. Then, we construct a codebook T>m (K) by keeping only those 
codewords in Cm{J) whose composition is P in each chunk of c symbols. We then show that the distribution of 
T>m(K) is the same as that of a codebook £m*(K) truncated to blocklength Mc. 

Codebook properties. Before proceeding to construct T>m{K), we first examine properties of the constant- 
composition codebook Cm {J) of composition P. Recall the definition of maximal error for randomized rateless 
codes in (1211 ) and d22l ). A result of Hughes and Thomas [26, Theorem 1] shows that for sufficiently large Mc, there 
exists a function E r such that for all J > 0, 5 > 0, and distribution Q on Z, 

max max£j(M, z, C M {J)) < exp (-Mc [^((Mc)" 1 log J + 5, W, P, Q) - 5} ) (53) 
zeT Mc (Q) je[J] 

ErdMc)- 1 log J + 5, W, P, Q) > max jo, I (P, W Q ) - 8 - ^- log J J . (54) 

Fix e 7 = j and let Q(M) be the set of all Q such that 

0< 6 - < I (P,W Q ) - 25 - ^og J . (55) 
If Q S Q(M), then we can rewrite the bound in d53l as follows: 

max max s,- (M, z, Cm( J)) < exp (—Mcer) . (56) 

z£T Mn (Q):QeQ(M) je[J] 

In particular, this gives the following bound on the expectation over Cm {J) of the average error: 

1 J 

-J2^{M, Z ,C M {J)) 



max E c M (J) 



zeT„„(Q):QeQ(M) 



J 

i=i 



<exp(-Mce 7 ) . (57) 



Use Markov's inequality to bound the probability that the average error exceeds a given value a\\ 

max P c ,n ( iy E .(M,z,C ilf (J)) >«i(c,M) | < ex P ( ~^ £r) . (58) 

Z6T M „(Q):Q6Q(M) I JV ' W ai(c,M) 

This establishes that for any 5 > the codebook has average error no more than a\(M) with high probability. 

Expurgation. We define a thinning operation on the codebook Cm (J) to form the codebook T>m{K) as follows: 
remove all codewords in Cm{J) which are not in the piecewise constant-composition set {T C (P)} M . That is, we 
keep only those codewords which have type P in each chunk. If there are fewer than K remaining codewords 
after this expurgation, declare an encoding error - if there are more than K then keep the first K codewords. The 
decoding rule is the same MMI rule as before. 

The probability of this encoding error can be bounded using Lemma[8]on page|20j which states that the probability 
that a codeword drawn uniformly from Tm c {P) is also in the set {T C (P)} M is at least /3q(c, M) = exp(— r]M log(c+ 
1)) for c sufficiently large. Therefore the expected number of codewords in Cm {J) that survive the thinning is at 
least Jexp(— r/M log c). Since the codewords are i.i.d., the probability that the number of codewords surviving the 
thinning is at least (3 J can be bounded: 

F(\C M (J)n{T c (P)} M \ <(3J) < J-e W (-J-D((3\\f3 (c,M))) . (59) 

By choosing K = 0o(c, M) 2 J, which corresponds to (3 = 0o(c, M) 2 , the probability of encoder error can be made 
arbitrarily small. The rate of codebook T>m{K) is 

1 1 2ri log c 

_l„ gK = _,o g J-^. (60) 
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Setting k = log K, note from d55l ), for sufficiently large c the error can be made small as long as 

■4- < min J(P,Wq) -35- 4 ■ (61) 

Setting (5 = <5'/4 in the original construction of Cm(J), for sufficiently large c, equation (l6Tb guarantees a bound 
on the error. In particular, since the codewords of T>m(K) are a subset of the codewords of Cm{K), the average 
error can increase at most by a factor of J/K: 

™ F Cm(j) I -Y £j (M, Z ,V M (K)) > ^ C, !!l ) < ^^"f"^ • (62) 
z 6 T Mn (Q):Q 6 Q(M) CmW I K ^ jV MV ~ (3 (c, M) 2 J ~ a\(c, M) 



3=1 



This shows that for any 8' > the average error can be bounded. 

Nesting. Consider the codebook £m{K) formed by drawing K codewords independently uniformly distributed 
on {T C (P)} M together with the MMI decoding rule. It is clear that T>m(K) has the same distribution as £m{K), 
so the bound d62l holds for £m{K) as well: 



/ 1 /„, „ ai(c,M) \ exp(-Mce 7 ) 

max ^ f (/o T7 > edM,z,£ M (K)) > n , \> < —. . (63) 

zeT Mn (Q)-.QeQ(M) ^ I if ^ Jl v ~ /3 (c, M) 2 J ~ ai(c,M) 

Note that £m{K) has the same distribution as the codebook £m*{K) truncated to blocklength Mc. The set of 
z E 2 M * C for which the bounds ([63]) hold is 



Z{K) = {z E 2 M * C : (zi, . . . , z Mc ) G T Mc (Q), Q E Q(M), M E {M*, . . . , M*}} . (64) 
For any z in this set and decoding time M such that (z\, . . . , zmc) £ Tm c {Q) for some Q E Q(M), the probability 

ai(c,M) 
/3o(c,M) 2 



that the random codebook £m* (K) truncated to blocklength M has average error probability exceeding ai ( c ' M ~> 



can be made arbitrarily small. 

Back to maximal error. The equation (l63l says that the average error under the randomized code £m{K) can 
be made arbitrarily small. Standard results on AVCs [24, Exercise 2.6.5] show that by permuting the message 
index the same bound holds for the maximal error. Thus with probability 1 — exp(— Mce^/a\{c, M) the randomly 
selected codebook has maximal error smaller than ^f^ 2 . The probability of encoding error is vanishingly small 
with respect to these quantities, so the total probability of error can be upper bounded: 

A ,„, , /exp(-Mce 7 ) aAc,M)\ 

e(M, z < max p \ 7 \ 1V ' (65) 
V ai(c,M) Po{c,Myj 

/expJ-Mcer) ai{c,M) \ 

<maX V «!(c,M) 'exp(-2r / Mlogc)y' ' 

Selecting «i(c, M) = exp(— Mcej/2) yields the following bound for sufficiently large c: 

i(M,z) < exp(-Mce 7 /3) . (67) 

Setting es = e 7 /3 yields the result. ■ 
Remark: As stated, the codebook constructed in Lemma @] requires a very large amount of common randomness 
shared between the encoder and decoder. This issue is discussed in more detail in Section [V] on page [T7] 



E. Proof of Theorem \J\ 

We now combine the results in the previous sections to prove Theorem Q] Namely, in Section HV-A1 we defined 
error events E\(r) and i^O™)- We then provided bounds on E\(r) in Lemma [3] and proved the existence of a 
randomized rateless code with a small maximal error probability in Lemma [4] As will be seen in the proof, 
Lemmas [3] and [4] provide a bound on E^ir). By combining this bound with the bound on E\(r) and parameter 
assumptions in (f3TT >. the result follows straightforwardly. 

Proof: The proof is divided into three parts. We first establish in equation (|68l ) that for sufficiently large N, 
the feedback rate can be made arbitrarily small. In the second part, we bound the error probability in (TTTb . In the 
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third part, we give a lower bound on the rate under the assumption the error event does not occur, which leads to 
equation (|90l ). These parts establish all necessary components in the statement of the result. 

We use the coding strategy proposed in Section IIII-Fi Note that under the parameter assumptions in (OTT ). for all 
A* > 0, there exists sufficiently large N such that the feedback rate (fTTT ) satisfies the following bound: 

i?fb < A* . (68) 

Fix a sequence z. The scheme induces a partition of z into rounds r = 1, 2, ... at times {£ r }- Let z(r) = z e f +1 
be the state sequence during the r-th round. The type of z can be written as: 

T * = E ij ^W ZlT <r) , (69) 
r 

where i r is the length of a round, as defined in equation (TT8T ). Lemma [3] shows that for any e\ > there exists an 
./V sufficiently large such that the channel estimation error probability ¥{E\(r)) is exponentially small. Taking a 
union bound over all rounds, the probability of estimation error is 

P (\J E 1 {rYj < 2 j exp (-a 2 t) . (70) 

By the parameter assumptions in (f3TT >. N/b and t grow polynomially in N, so for large N the exponential term 
dominates and the probability of an estimation error in any round goes to 0. Given any e > 0, for sufficiently large 
N, equation ( TTUl ) gives the following bound: 

P(\j£i(r)^<|. (71) 

Suppose round r was terminated due to "BAD NOISE." In this case, from d2Tb we have the following: 

I (p,W{t M( riA -e x < t . (72) 

By Lemma [3 I (p, Wr M ^j is close to I (P, W z ^). That is, there exists an N sufficiently large such that with 

probability 1 — exp (— c^i), we have that / (P, W % i r \) < r + 3ei/2. For any p > 0, we can choose a large N and 
small r such that the following holds for all "BAD NOISE" rounds: 

I (P,W z(r) ) < p/2 . (73) 

Therefore, for rounds which are terminated due to bad noise, the state sequence z(r) has a type T z r r \ such that 
I (P, W z m) is small. 

Now suppose the decoder attempted to decode at the end of round r. Then ( f28T > implies that the estimated 
empirical mutual information from the training satisfies a different inequality: 

k_ 

(b-t~M(r) ' 

If the event E\{r) does not happen, then I (^P, Wr M ^^j is within ei/2 of the empirical mutual information during 
the non-training positions: 

k ( 1 M(r) \ 

< • ( p - w^) S ,£ Zi) ) - t • 

Thus, conditioned on Ef(r) and under our assumption (f3TT >. (|75T ) and Lemma |4] imply that for 5' = ei/2 there 
exists a sufficiently large iV, exponent es > 0, and an (M* ,b — t,k) randomized rateless code with error e(M, z) < 
exp(— M(b — t)es) for every round r in which decoding occurs. A union bound then implies the decoding error 
probability over all rounds in which decoding occurs can be bounded: 



^^y^w^v (74) 



f]E1(r)) <^exp(-(b-t)e 8 ) . (76) 
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By d3TT ), this can be made arbitrarily small for sufficiently large N, and therefore for any e > 0, ( TTTb and (1761 ) 
imply there exists an N sufficiently large such that the estimation error and decoding error can be made smaller 
than e: 

U Ei(r) J < e . (77) 

The remaining thing is to calculate the rate, given that none of the error events occur. If the decoder attempted 
to decode after M{r) chunks, then after M(r) — 1 chunks the threshold condition in (|28l ) was not satisfied: 

1 >l(p,W r ^- l A-e 1 , (78) 



(6-t) ■ (M(r) - 1) 

Our assumption in equation ( f3TT ) that (b(N)) 2 /k(N) — ► and our lower bound on the length of a round in LemmaQ] 
is Q(k(N)/b(N)) channel uses imply that for sufficiently large iV, the amount that the estimated mutual information 
can change over the course of a single chunk (b(N) channel uses) can be made arbitrarily small. More formally, 
for any €q > 0, for sufficiently large N, 



I (P,W$ M W-V) - 1 (P,W$ M W 



Thus 

k 



< e 6 . (79) 

(80) 



(6 - t) ■ M(r) V M ( r ) J (b-t)- (M(r) - 1) 

>-{ 1 -m)( I ( p M MM - 1) )-«) < 81) 

Finally, the overall empirical rate for the round is slightly lower because of overhead from training: 

' >(l- Jrr) (l-r)(l(P, Wl M{r)) ) " ^ " (83) 



bM(r) ~ V M(r) J \ b 
Under the assumptions in (OTT ) and conditioned on (|33T ) not occurring, for any p > there exists an N sufficiently 
large such that 

^ >I (P, -p/2. (84) 

The final thing to consider is the last round r* in which the decoder does not decode. The maximum length of 
this round is M*b, and 



I{ P , Wm(r) )< — 



■-l{P,W a(r) ) < — max{m|y|} • (85) 
By (1311 ). for sufficiently large N, (|85T ) can be made to satisfy the following condition: 

— ( p 5 W z(r-)) < p/2 • (86) 

To summarize, for sufficiently large N and each round r in which the decoder feeds back "BAD NOISE" or 
"DECODED", the rate at which the scheme decodes can be lower bounded by 

R(r) >l{P,W z(r) ) -p/2 , (87) 

which follows from (1731 ) and (l84l ). Finally, we use (l86l ). (1871 ). and the convexity of mutual information to provide 
a lower bound on the overall rate of the scheme: 

R > E ^ 'n'" 1 i 1 (P W <r)) ~ P/2) (88) 

= J(P,W z )- i o. (90) 
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As mentioned above, the result now follows immediately from d68l ), ( T77T ), and d90l) . ■ 

V. Discussion 

The central question we tried to address in this paper was how much feedback is needed to achieve the channel 
mutual information in an individual sequence setting of [3]. Limited feedback in two-way and relaying systems have 
been studied before [27]-[29] and are used in many modern-day communication protocols for control information. 
Research interest on limited feedback for multiuser and multiantenna models has grown tremendously (see [30] 
and references therein). Quantifying the role and possible benefits of limited feedback is an important step in 
understanding how to structure adaptive communication systems. 

In this paper we described a coding strategy under a general channel uncertainty model that uses limited feedback 
to achieve rates arbitrarily close to an i.i.d. discrete memoryless channel with the same first-order statistics. Feedback 
allows the system to adapt the coding rate based on the channel conditions. When each element in the class of 
channels over which we are uncertain has the same capacity achieving input distribution, the coding strategy achieves 
rates at least as large as the empirical capacity, which is defined as the capacity of an i.i.d. discrete memoryless 
channel with the same first-order statistics. Since the rates that we can guarantee for our scheme are close to the 
average channel in a round, our total rate over many rounds may in fact exceed the empirical capacity. This is due 
to the convexity of mutual information in the channel. 

The work is a commentary on an earlier investigation by Shayevitz and Feder [3] that considered the case in 
which the encoder has access to full output feedback from the decoder and allows the encoder to provide control and 
estimation information in a set of training sequences that can be selected via common randomness. Furthermore, 
their scheme does not require a fixed blocklength in advance and hence has an infinite horizon. By contrast, our 
strategy can be viewed as a kind of incremental redundancy hybrid ARQ [7], in which the decoder uses the feedback 
link to terminate rounds that are too noisy while less noisy rounds are individually decoded. In order to set the 
parameters for our scheme we must fix a total blocklength in advance, although it may be possible to redefine the 
scheme to operate without a horizon, as in [3]. 

An interesting point is that our basic algorithm uses standard "tricks" for communication systems, such as channel 
estimation via pilot signals, ARQ with rateless codes, and randomization. By adapting or reusing technologies that 
have already been developed, these gains can be realized more easily. Several open questions and extensions of the 
algorithm presented here would be of interest, two of which are the following: 

1) The necessary amount of common randomness. The algorithm presented here requires common randomness 
between the encoder and the decoder to show that zero-rate feedback is sufficient to achieve the empirical 
mutual information. We now provide an account of how much common randomness is required. There are two 
places where our algorithm requires common randomness, namely, (i), the selection of the channel training 
positions, and (ii), the random selection of the codebook for each round. 

For (i), the training positions, under our parameter assumptions in (PTT) . logiV bits are required to indicate 
the position of each of the t = @(N 93 ) training positions for each chunk of length b = Q(N 92 ), where 
| > 92 > 93 > 0. Since there are N/b chunks, this requires at total of 

N • • logiV = 6 (n 1 ^ 92 " 93 ^ ■ log Nj bits , 

which, under our parameter assumptions is sublinear in TV. For ( ii), the selection of a codebook for each round 
can require as much as M* • C max bits of common randomness per codeword for a total of M* ■ C max ■ 2 Cmax 
bits of common randomness, where C max = log min{|Af |, \y\}. The total number of rounds can be as large 
as where M* and M* are defined in Lemma [1] Thus, codebook selection requires 

M* ■ C max • 2 M '-°— ■ — = (Cmax)2 • N ■ 2 M '-°— bits , 
M* r 

where r, defined in (l27l) . is a parameter of the algorithm that does not depend on N. Thus, the total common 
randomness required is superlinear in N. 

Reducing common randomness is outside the scope of the current work. However, if common randomness 
were not available between the encoder and decoder, it could be provided by the feedback link, but then the 
strategy considered in this paper would require a prohibitively large feedback rate that would increase with 
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the blocklength N. To show instead that the feedback rate could be made asymptotically negligible in such a 
setting, one would need to prove the existence of a strategy for which the total bits of common randomness 
required would be sublinear in the blocklength N. 

A potential technique that might be useful could be to adapt tools from the theory of arbitrarily varying 
channels [31] to find nested code constructions that use a limited amount of common randomness [32]. Such 
an argument would require showing that a randomized code with support on T = (M*6) 2 codes can be made 
from iid sampling of the randomized code of Lemma |4] This new randomized code could then be used to 
establish a sublinear number of bits. Specifically, in each round, this new randomized code could be used by 
selecting one of the T codes for use. This would require logT = O(logiV) bits per round for a total cost of 
at most 0((N/M*) logJV), which would be sublinear in N. 

Another potential method, more in the interactive coding spirit of feedback systems, could be to show the 
existence of deterministic list-decodable codes with small list sizes. If the list is of size L, the decoder could 
find L bits in the message, which could be used to disambiguate the list [6]. By using Llogk bits in the 
feedback, the decoder could request those L bits from the encoder. By sacrificing just O(L) more forward 
channel uses, the encoder could send the L bits with negligible impact to the rate. If the empirical mutual 
information in the next round were above r, this would be sufficient for success. 
2) Adaptation of the channel input, and thus, codebook distribution. An apparent limitation of the algorithm 
presented here is that the channel input distribution is selected once and kept fixed throughout, irrespective 
of the behavior of the state sequence. Adaptation of the channel input distribution may lead to higher or 
lower rates. One interesting question would be whether universal prediction techniques [33] can be used in 
conjunction with channel coding to adapt the channel input. Another set of interesting questions emerges if 
we consider performance on a sequence that comes from a certain class of sequences. For example, if one 
were to consider an alternate notion of empirical capacity in which the empirical sequences were estimated 
as finite-order Markov models, adapting the channel input distribution may give quantifiable benefits. 
The individual sequence model considered in this paper is by no means the only way of modeling channel 
uncertainty. One model which does away with modeling the channel state was proposed by Lomnitz and Feder 
[34]. An alternative model within the state sequence framework is a class of noise models that varies in a piecewise- 
constant fashion. This model is related to the on-line estimation problems studied by Kozat and Singer [35] and 
may be useful to understand block fading. For such models we could consider modifying our strategy to adapt 
the value of k by trying to learn the coherence time of the channel. In the sense of competitive optimality, the 
competition class could be coding strategies that know the coherence intervals exactly. Variations on the model of 
the feedback link may also lead to interesting new results. Alternative channel models in which the feedback is 
noisy or allowed to have time-varying rate may present new issues to consider, particularly for the case in which 
there is uncertainty in the feedback link as well. For future communications systems that must share common 
resources, such investigations may shed new light on strategies in these settings. 
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Appendix 

We provide here the proofs of the lemmas used in the analysis of our algorithm^. 

A. Bounds on entropy and mutual information 

We need a short technical lemma about concave functions. 

Lemma 5: Let / be a concave increasing function on [a, b]. Then if a < x < x + e < b, we have 

f(x + e)-f(x) <f(a + e)-f(a) . (91) 

3 We were unable to find a standard reference for the entropy bounds below, which is why we provide the derivation. The proofs can be 
omitted for space if the reviewers and editors think it appropriate to do so. 
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Proof: Without loss of generality we can take a = 0, b = 1, and f(a) = 0. Now consider 

/(*) = f (-?- • (x + e) + ^- • fj) > -^-/(x + e) + -^/(0) 
\x + e x + e / x + e x + e 

' f(x + e) 



x + e 



/(e) = /(-?-. + -^- - (* + £))> -^-/(0) + ^—f(x + e ) 
.x + e x + e / x + e x + e 

/(x + e). 



x + e 



Therefore 



/(*) + /(e) > fix + e) , (92) 



as desired. ■ 
Using the preceding lemma, we can show that a bound on the total variational distance between two distributions 

gives a bound on the entropy between those two distributions. 

Lemma 6: Let P and Q be two distributions on a finite set S with |<S| > 2. If 

\P(s) - Q(s)\ < e Vse5, (93) 

then 

\H{P) - H(Q)\ < (\S\ - 1) • h b {e) + {\S\ - 1) log(|5| - 1) • e , (94) 

where hb(-) is the binary entropy function. 

Proof: Let S = {s±, S2, ■ ■ ■}■ We proceed by induction on |«S|. Suppose |«S| = 2, and let p = P(s\) and 
q = Q(si). The entropy function hj,(x) is concave, increasing on [0,1/2] and decreasing on [1/2,1]. Applying 
Lemma [5] to each interval, we obtain the bound: 

\h b (x + e) - h b (x)\ < h b (e) . (95) 

Since H(P) = h b (p) and H(Q) = h b (q), this proves our result. 

Now suppose that the lemma holds for \S\ < m — 1, and consider the case \S\ = m. Without loss of generality, 
let P(s m ) > and Q(s m ) > 0. Let A = (1 — P(s m )) and \i = (1 — Q(s m )) and note that |A — fi\ < e by assumption. 
Define the (m — 1) dimensional distributions P' = X^ 1 (P(s\), . . . P(s TO _i)) and Q' = X^ 1 (Q(s\), . . . Q(s m -i))> 
so that 

P = (\P\ (1 - A)) 
Q = (fiQ', (1 - n)) . 

Therefore, 

H(P) = h b (X) + XH(P') 
H(Q) = htifi) + llH(Q') . 

Now we we can expand the difference of the entropies. Using the fact that A < 1, the induction hypothesis on 
\H{P') — H(Q')\ and \h b (X) — h b (p)\, and the cardinality bound on the entropy H{Q') yields the result: 

\H{P) - H(Q)\ = \XH(P') - nH{&) + h b (X) - h b (fi)\ 

< X\H(P') - H(Q')\ + |A - fi\H(Q') + (MA) - h b (fi)\ 

< (m - 2) • h b (e) + (m - 2) log(m - 2) • e + log(m - 1) • e + h b (e) 

< (m — 1) • h b (e) + (m — 1) log(m — 1) • e . 

■ 

Lemma 7: Let W(y\x) and V(y|a;) be two channels with finite input and output alphabets X and y. If 

\W(y\x)-V(y\x)\<e V(x, y) G * x ^ , (96) 
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then for any input distribution P on X we have 

\I(P,W) - I(P,V)\ < 2(\y\-l) -h b (e) + 2(\y\ -l)log(|y |-1) -e , (97) 

where h b (-) is the binary entropy function. 

Proof: We simply apply Lemma [6] twice. Let Q\y and Qy be the marginal distributions on y under channels 
W and V respectively. Then 

\Qw(y) - Qv(y)\ < £ P{x)\W{y\x) - V{y\x)\ < e . 

X 

Now we can break apart the mutual information and use Lemma [6] on each term: 

\I(P, W) - I(P, V)\ < \H(Q W ) - H(Q V )\ + P(x)\H(W(Y\X = x)) - H(V(Y\X = x))\ 

X 

< 2(\y\ - 1) • h b (e) + 2{\y\ - 1) log(|^| - 1) • e . 



B. Properties of concatenated fixed composition sets 

Let t(x) be the type of x. Let T n (P) = {x € X n : r(x) = P} be the set of of all length-n vectors of type P. 
For a vector x, let x™ be the first m elements of x. 

Lemma 8: For all finite sets X, and all types P with po = mm x& x P(x) > 0, there exists r\ = r](P) < oo such 
that for sufficiently large n, for all M > 0: 

IT (P)\ M 

L L > exp(-rjM log n) . (98) 

| J- Mn [P)\ 

Proof: We begin with the following [24, p. 39] : 

kH(P) - ^Lzl l og (2^) - v x {P) < log |T fc (P)| < kH{P) - iog(27rA:) - u 2 (P) , 

for < ui (P) < co and < ^{P) < oo since p x > po for all x. From this we can take the ratio: 

it ( p)\^ \x\ i \x\ i 

lo § \rr /mi ^ ~ mLJ ^- log(27rn) - Mu^P) + log(27rMn) + u 2 (P) 

\±Mn\P)\ 2 2 

For fixed P and sufficiently large n, this lower bound is Q(M logn), which establishes the result. ■ 
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